Features description

  • instant: record index
    • dteday : date
    • season : season (1:springer, 2:summer, 3:fall, 4:winter)
    • yr : year (0: 2011, 1:2012)
    • mnth : month ( 1 to 12)
    • hr : hour (0 to 23)
    • holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
    • weekday : day of the week
    • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
    • weathersit :
      • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
      • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
      • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
      • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
    • temp : Normalized temperature in Celsius. The values are divided to 41 (max)
    • atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
    • hum: Normalized humidity. The values are divided to 100 (max)
    • windspeed: Normalized wind speed. The values are divided to 67 (max)
    • casual: count of casual users
    • registered: count of registered users
    • cnt: count of total rental bikes including both casual and registered

Comme pour toute entreprise de location, il est important de connaître la demande que le service peut rencontrer. Dans un but évident d'optimisation des coûts. Dans notre exemple, une société de location de vélos, il peut être intéressant de prédire le nombre de vélos loués par heure. Notamment pour anticiper les stocks, le personnel en service ou encore positionner les vélos dans des endroits de la ville connaissant une forte demande. C'est dans cette optique, que nous allons chercher à comprendre les features qui impactent le nombre de vélos loués par heure.

Data Overview

In [1]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt

file = pd.read_csv("data.csv")
file
Out[1]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17374 17375 2012-12-31 1 1 12 19 0 1 1 2 0.26 0.2576 0.60 0.1642 11 108 119
17375 17376 2012-12-31 1 1 12 20 0 1 1 2 0.26 0.2576 0.60 0.1642 8 81 89
17376 17377 2012-12-31 1 1 12 21 0 1 1 1 0.26 0.2576 0.60 0.1642 7 83 90
17377 17378 2012-12-31 1 1 12 22 0 1 1 1 0.26 0.2727 0.56 0.1343 13 48 61
17378 17379 2012-12-31 1 1 12 23 0 1 1 1 0.26 0.2727 0.65 0.1343 12 37 49

17379 rows × 17 columns

In [2]:
print(file.info(),"\n")
print(file.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB
None 

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

DataViz & Statistical tests

In [3]:
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
In [4]:
fig, ax = plt.subplots(figsize=(20,15))
matrix = np.triu(file.corr())
cmap = sns.diverging_palette(220, 20, sep=20, as_cmap=True)
sns.heatmap(file.corr(),annot = True, fmt='.1g', mask=matrix, ax=ax, cmap=cmap)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d386890>

We can see few strong correlations, between the cnt and if a customer is registered or casual. So, they can overfit the model.

In addition, features temp and atemp have a correlation of 1. So, we can keep only one feature to represent both in our model. This will made a model more faster to compute.

Seasons and month have obviously a hight correlation.

In [5]:
fig = px.violin(file, x="weekday", y="casual", box=True)
fig.show()

from scipy import stats

print('Average :',file.groupby(file.weekday)['casual'].mean())
print('Standard Deviation :',file.groupby(file.weekday)['casual'].std())
fig = px.violin(file, x="weekday", y="registered", box=True)
fig.show()
print('Average :',file.groupby(file.weekday)['registered'].mean())
print('Standard Deviation :',file.groupby(file.weekday)['registered'].std())
Average : weekday
0    56.163469
1    28.553449
2    23.580514
3    23.159192
4    24.872521
5    31.458786
6    61.246815
Name: casual, dtype: float64
Standard Deviation : weekday
0    68.090663
1    35.097056
2    26.170895
3    27.790658
4    27.768088
5    36.487534
6    77.020582
Name: casual, dtype: float64
Average : weekday
0    121.305356
1    155.191206
2    167.658377
3    167.971313
4    171.564144
5    164.677121
6    128.962978
Name: registered, dtype: float64
Standard Deviation : weekday
0    105.972899
1    159.517897
2    170.103245
3    172.344752
4    169.327395
5    149.905977
6    108.600931
Name: registered, dtype: float64

Si nous prenons le jour 0 (lundi) du cas "registered". Avec une moyenne de 121 locations/lundi et une std de 106, cela signifie que nous avons 68% de chance que le nbr de location soit entre (121-106)-(121+106) = 15-207.

Les autres jours suivent une tendance similaire. Ce qui est un écart assez important. On peut donc penser que le dataset possède un certain nombre de "outliers" et que donc la médiane est une donnée plus précise que la moyenne dans notre cas.

In [6]:
#Dropping the outlier rows with standard deviation. Trimmed mean can be a solution too
factor = 3

upper_lim = file['temp'].mean () + file['temp'].std () * factor
lower_lim = file['temp'].mean () - file['temp'].std () * factor
file = file[(file['temp'] < upper_lim) & (file['temp'] > lower_lim)]

upper_lim = file['atemp'].mean () + file['atemp'].std () * factor
lower_lim = file['atemp'].mean () - file['atemp'].std () * factor
file = file[(file['atemp'] < upper_lim) & (file['atemp'] > lower_lim)]

upper_lim = file['hum'].mean () + file['hum'].std () * factor
lower_lim = file['hum'].mean () - file['hum'].std () * factor
file = file[(file['hum'] < upper_lim) & (file['hum'] > lower_lim)]

upper_lim = file['windspeed'].mean () + file['windspeed'].std () * factor
lower_lim = file['windspeed'].mean () - file['windspeed'].std () * factor
file = file[(file['windspeed'] < upper_lim) & (file['windspeed'] > lower_lim)]

upper_lim = file['registered'].mean () + file['registered'].std () * factor
lower_lim = file['registered'].mean () - file['registered'].std () * factor
file = file[(file['registered'] < upper_lim) & (file['registered'] > lower_lim)]

upper_lim = file['casual'].mean () + file['casual'].std () * factor
lower_lim = file['casual'].mean () - file['casual'].std () * factor
file = file[(file['casual'] < upper_lim) & (file['casual'] > lower_lim)]

file = file.reset_index()
file = file.drop(['index'], axis=1)
file
Out[6]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16411 17375 2012-12-31 1 1 12 19 0 1 1 2 0.26 0.2576 0.60 0.1642 11 108 119
16412 17376 2012-12-31 1 1 12 20 0 1 1 2 0.26 0.2576 0.60 0.1642 8 81 89
16413 17377 2012-12-31 1 1 12 21 0 1 1 1 0.26 0.2576 0.60 0.1642 7 83 90
16414 17378 2012-12-31 1 1 12 22 0 1 1 1 0.26 0.2727 0.56 0.1343 13 48 61
16415 17379 2012-12-31 1 1 12 23 0 1 1 1 0.26 0.2727 0.65 0.1343 12 37 49

16416 rows × 17 columns

In [7]:
day = file.groupby(['weekday']).sum()
day = pd.DataFrame(day)
day = day.reset_index()

fig = go.Figure()

fig.update_layout(
    title="Distribution de la location des vélos par jour",
    xaxis_title="Jour",
    yaxis_title="Nbr de vélos",
)

fig.add_trace(go.Scatter(
    x=day['weekday'],
    y=day['cnt'],
))

fig.show()

On peut voir une augmentation graduelle de l'utilisation du service au cours de la semaine. Avec un pic le samedi qui pourrait correspondre à une utilisation plus importantes des utilisateurs "casual"

In [8]:
trip = file.groupby(file.hr)['cnt'].sum()
trip = pd.DataFrame(trip)
trip = trip.reset_index()

fig = go.Figure()

fig.update_layout(
    title="Distribution de la location des vélos par heure",
    xaxis_title="Heure",
    yaxis_title="Nbr de vélos",
)

fig.add_trace(go.Scatter(
    x=trip['hr'],
    y=trip['cnt'],
))

fig.show()

En premier lieu, on peut observer une utilisation du service évolutive en fonction de l'heure de la journée. Avec notamment, 2 fortes hausses d'utilisation à 8H et 19H. Ce qui correspond aux horaires de début et fin de travail.

In [9]:
trip = file
trip = trip[trip['workingday'] == 0]
trip = trip.groupby(trip.hr)['cnt'].sum()

trip2 = file
trip2 = trip2[trip2['workingday'] == 1]
trip2 = trip2.groupby(trip2.hr)['cnt'].sum()

trip = pd.DataFrame(trip)
trip2 = pd.DataFrame(trip2)
trip = trip.reset_index()
trip2 = trip2.reset_index()

fig = go.Figure()

fig.update_layout(
    title="Distribution de la location des vélos par heure",
    xaxis_title="Heure",
    yaxis_title="Nbr de vélos",
)

fig.add_trace(go.Scatter(
    x=trip['hr'],
    y=trip['cnt'],
    name="Week-End"
))

fig.add_trace(go.Scatter(
    x=trip2['hr'],
    y=trip2['cnt'],
    name="Working Day"
))

fig.show()

Le WE, on peut voir une utilisation du service assez uniforme entre 10H et 20H. Alors que les jours de travail la courbe suit la même tendance que le plot juste au dessus. Avec des tendances moins smooth et plus affirmées. Dans le premier plot, la baisse entre 9H et 16H est moins importante car compensée par les chiffres du WE, qui connaissent une augmentation "uniforme" là où la semaine connait une baisse.

Dans la prédiction de vélo, il sera donc important de se situer si nous sommes dans un "working day" ou non.

In [10]:
casu = file.groupby(file.weekday)['casual'].sum()
regi = file.groupby(file.weekday)['registered'].sum()
casu = pd.DataFrame(casu)
casu = casu.reset_index()
regi = pd.DataFrame(regi)
regi = regi.reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=regi['weekday'],
    y=regi['registered'],
    name='Registered Clients',
    marker_color='#12c2e9'
))
fig.add_trace(go.Bar(
    x=casu['weekday'],
    y=casu['casual'],
    name='Casual Clients',
    marker_color='#c471ed'
))

fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text="Distribution of features 'Casual' & 'Registered' per day")
fig.show()

casu = file.groupby(file.mnth)['casual'].sum()
regi = file.groupby(file.mnth)['registered'].sum()
casu = pd.DataFrame(casu)
casu = casu.reset_index()
regi = pd.DataFrame(regi)
regi = regi.reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=regi['mnth'],
    y=regi['registered'],
    name='Registered Clients',
    marker_color='#12c2e9'
))
fig.add_trace(go.Bar(
    x=casu['mnth'],
    y=casu['casual'],
    name='Casual Clients',
    marker_color='#c471ed'
))

fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text="Distribution of features 'Casual' & 'Registered' per month")
fig.show()

casu = file.groupby(file.season)['casual'].sum()
regi = file.groupby(file.season)['registered'].sum()
casu = pd.DataFrame(casu)
casu = casu.reset_index()
regi = pd.DataFrame(regi)
regi = regi.reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=regi['season'],
    y=regi['registered'],
    name='Registered Clients',
    marker_color='#12c2e9'
))
fig.add_trace(go.Bar(
    x=casu['season'],
    y=casu['casual'],
    name='Casual Clients',
    marker_color='#c471ed'
))

fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text="Distribution of features 'Casual' & 'Registered' per season")
fig.show()

casu = file.groupby(file.yr)['casual'].sum()
regi = file.groupby(file.yr)['registered'].sum()
casu = pd.DataFrame(casu)
casu = casu.reset_index()
regi = pd.DataFrame(regi)
regi = regi.reset_index()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=regi['yr'],
    y=regi['registered'],
    name='Registered Clients',
    marker_color='#12c2e9'
))
fig.add_trace(go.Bar(
    x=casu['yr'],
    y=casu['casual'],
    name='Casual Clients',
    marker_color='#c471ed'
))

fig.update_layout(barmode='group', xaxis_tickangle=-45, title_text="Distribution of features 'Casual' & 'Registered' per year")
fig.show()

Les utilisateurs sont très majoritairements des personnes abonnées au service. Le Lundi et Dimanche ainsi que les saisons/mois chauds connaissent une hausse des clients non abonnés, mais cela reste moindre.

La distribution par jour, sauf pour le lundi est dimanche, est assez proche.

Le service semble avoir du succès car d'une année à une autre, le nombre de location a augmenté de 1.2M à 2M. Soit une augmentation de 66.67%

In [11]:
day = file.groupby(['hum']).sum()
day = pd.DataFrame(day)
day = day.reset_index()

fig = go.Figure()

fig.update_layout(
    title="Distribution de la location des vélos en fonction de l'humidité",
    xaxis_title="Humidité",
    yaxis_title="Nbr de vélos",
)

fig.add_trace(go.Scatter(
    x=day['hum'],
    y=day['cnt'],
))

fig.show()

On ne peut observer de pattern concernant l'influence de l'humidité avec la nombre de location

In [12]:
day = file.groupby(['windspeed']).sum()
day = pd.DataFrame(day)
day = day.reset_index()

fig = go.Figure()

fig.update_layout(
    title="Distribution de la location des vélos en fonction de la force du vent",
    xaxis_title="Windspeed",
    yaxis_title="Nbr de vélos",
)

fig.add_trace(go.Scatter(
    x=day['windspeed'],
    y=day['cnt'],
))

fig.show()

On peut voir que l'ulisation du service diminue plus le vent devient important. On peut expliquer la baisse entre 0 et 0.9 par le fait qu'il est rare d'avoir aucun ou très peu de vent.

In [13]:
fig = px.sunburst(file, path=['weekday','weathersit'], values='cnt')
fig.show()

Peut importe le jour de la semaine, les utilisateurs n'utilisent pas le service au-delà du niveau 2 de météo. C'est à dire quand il pleut.

In [14]:
from sklearn.cluster import KMeans
from sklearn import cluster
from sklearn.preprocessing import StandardScaler

temp = file.groupby(file.temp)['cnt'].sum()
temp = pd.DataFrame(temp)
temp = temp.reset_index()

fig = px.bar(temp, x='temp', y='cnt',color='cnt',height=400)
fig.show()

x = temp['temp']
y = temp['cnt']

temp = StandardScaler().fit_transform(temp)
kmeans = KMeans(n_clusters=3)
kmeans.fit(temp)
cluster = kmeans.predict(temp)
plt.scatter(x,y, c=cluster)
Out[14]:
<matplotlib.collections.PathCollection at 0x1a1f5c0e90>

On peut voir que le cluster vert, qui correspond aux données entre 0.30 et 0.8 sur le diagramme. Représente une hausse de l'utilisation du service pour les températures comprises entre :

  • 0.30*41 = 12.3 °C
  • 0.8*41 = 36.8 °C
In [15]:
mnth = pd.DataFrame(file)
mask = (mnth['dteday'] > '2012-01-01') & (mnth['dteday']<= '2012-12-31')
mnth = mnth.loc[mask]
mnth = mnth.groupby(mnth.mnth)['cnt'].sum()
mnth = mnth.reset_index()

fig = px.bar(mnth, x='mnth', y='cnt',color='cnt',height=400)
fig.show()

from scipy.stats import normaltest
from scipy.stats import shapiro

stat, p = normaltest(mnth['cnt'])
print('Statistics=%.2f, p=%.2f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
    
stat, p = shapiro(mnth['cnt'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Statistics=0.48, p=0.79
Sample looks Gaussian (fail to reject H0)
Statistics=0.946, p=0.580
Sample looks Gaussian (fail to reject H0)
/opt/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1450: UserWarning:

kurtosistest only valid for n>=20 ... continuing anyway, n=12

In [16]:
#Z-score
mean = mnth['cnt'].mean()
std = mnth['cnt'].std()
print("mean = ",mean)
print("std = ",std)
mean =  127656.33333333333
std =  18415.937970500807
In [17]:
z = (x - mean)/std 
z1 = (107656-mean)/std 
z2 = (147656-mean)/std 
print("z1 = " ,z1)
print("z2 = " ,z2)
z1 =  -1.0860339215613375
z2 =  1.0859977210339613
In [18]:
print("0.3621*2=",0.3621*2)
print("For the year 2012, 72,4% of bike rentals are between 107369 and 147369 per month")
0.3621*2= 0.7242
For the year 2012, 72,4% of bike rentals are between 107369 and 147369 per month

Regarding thoses results. I decided to keep the following features :

  • mnth
  • season
  • hr
  • yr
  • weekday
  • workingday
  • weathersit
  • temp
  • windspeed

I did not keep the feature "registered" because it has a correlation of 1 with the "ctn" I did not keet the feature "atemp" because it has a correlation of 1 with the "temp"

Machine Learning

In [19]:
ml = file[['season','hr','mnth','yr','weekday','workingday','weathersit','windspeed','temp','cnt']]
X = ml.drop(['cnt'], axis=1)
y = ml['cnt']
In [20]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from vecstack import stacking
from sklearn import ensemble
from sklearn.metrics import r2_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

J'ai choisi d'utiliser XGB. Par conséquent, avec un système d'arbre, je n'ai pas besoin de normaliser les données.

XGBRegressor

In [21]:
parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['reg:squarederror'],
              'learning_rate': [.01,.03, 0.05, .07, .1], #so called `eta` value
              'max_depth': [5, 6, 7, 9],
              'min_child_weight': [4],
              'silent': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [100,500,800,1000]}
In [22]:
xgb1 = XGBRegressor()
xgb_grid = GridSearchCV(xgb1,
                        parameters,
                        cv = 2,
                        n_jobs = -1,
                        verbose=True)

xgb_grid.fit(X_train,y_train)

print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
Fitting 2 folds for each of 80 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   30.0s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  2.3min finished
0.9242993755190563
{'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': 9, 'min_child_weight': 4, 'n_estimators': 1000, 'nthread': 4, 'objective': 'reg:squarederror', 'silent': 1, 'subsample': 0.7}
In [23]:
model = XGBRegressor(colsample_bytree= 0.7, learning_rate= 0.01, max_depth= 9, min_child_weight= 4, n_estimators=1000, nthread= 4, objective= 'reg:squarederror', silent=1, subsample= 0.7)
model.fit(X_train, y_train)
print(model.score(X_test,y_test))
print(model.score(X_train,y_train))

from sklearn.metrics import mean_squared_error

ypred = model.predict(X_test)
mse = mean_squared_error(y_test,ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % np.sqrt(mse))
0.9346173094603396
0.9724444626448591
MSE: 1509.23
RMSE: 38.85
In [24]:
from xgboost import plot_importance
plt.rcParams["figure.figsize"] = (14, 14)
plot_importance(model)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1bfade90>
In [25]:
from numpy import sort
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = r2_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBRegressor(colsample_bytree= 0.7, learning_rate= 0.01, max_depth= 9, min_child_weight= 4, n_estimators=1000, nthread= 4, objective= 'reg:squarederror', silent=1, subsample= 0.7)
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = r2_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
Accuracy: 93.46%
Thresh=0.013, n=9, Accuracy: 93.46%
Thresh=0.027, n=8, Accuracy: 93.43%
Thresh=0.032, n=7, Accuracy: 92.23%
Thresh=0.049, n=6, Accuracy: 90.86%
Thresh=0.059, n=5, Accuracy: 87.65%
Thresh=0.063, n=4, Accuracy: 83.14%
Thresh=0.191, n=3, Accuracy: 70.32%
Thresh=0.219, n=2, Accuracy: 50.98%
Thresh=0.347, n=1, Accuracy: 50.35%
In [26]:
X = X.drop(['yr','workingday','season'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

xgb1 = XGBRegressor()
xgb_grid = GridSearchCV(xgb1,
                        parameters,
                        cv = 2,
                        n_jobs = -1,
                        verbose=True)

xgb_grid.fit(X_train,y_train)

print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
Fitting 2 folds for each of 80 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   22.4s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  2.0min finished
0.8202854180164514
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 6, 'min_child_weight': 4, 'n_estimators': 800, 'nthread': 4, 'objective': 'reg:squarederror', 'silent': 1, 'subsample': 0.7}
In [27]:
model = XGBRegressor(colsample_bytree= 0.7, learning_rate= 0.03, max_depth= 6, min_child_weight= 4, n_estimators=800, nthread= 4, objective= 'reg:squarederror', silent=1, subsample= 0.7)
model.fit(X_train, y_train)
print(model.score(X_test,y_test))
print(model.score(X_train,y_train))

from sklearn.metrics import mean_squared_error

ypred = model.predict(X_test)
mse = mean_squared_error(y_test,ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % np.sqrt(mse))
0.8253557779728541
0.8845683376273054
MSE: 4031.32
RMSE: 63.49
In [28]:
hr = []
mnth= []
weekday= []
weathersit= []
windspeed= []
temp= []

for i in range(0,24):
    hr.append(i)
    mnth.append(6)
    weekday.append(6)
    weathersit.append(1)
    windspeed.append(0.2)
    temp.append(0.4)

var = pd.DataFrame()
var['hr'] = hr
var['mnth'] = mnth
var['weekday'] = weekday
var['weathersit'] = weathersit
var['windspeed'] = windspeed
var['temp'] = temp

var
Out[28]:
hr mnth weekday weathersit windspeed temp
0 0 6 6 1 0.2 0.4
1 1 6 6 1 0.2 0.4
2 2 6 6 1 0.2 0.4
3 3 6 6 1 0.2 0.4
4 4 6 6 1 0.2 0.4
5 5 6 6 1 0.2 0.4
6 6 6 6 1 0.2 0.4
7 7 6 6 1 0.2 0.4
8 8 6 6 1 0.2 0.4
9 9 6 6 1 0.2 0.4
10 10 6 6 1 0.2 0.4
11 11 6 6 1 0.2 0.4
12 12 6 6 1 0.2 0.4
13 13 6 6 1 0.2 0.4
14 14 6 6 1 0.2 0.4
15 15 6 6 1 0.2 0.4
16 16 6 6 1 0.2 0.4
17 17 6 6 1 0.2 0.4
18 18 6 6 1 0.2 0.4
19 19 6 6 1 0.2 0.4
20 20 6 6 1 0.2 0.4
21 21 6 6 1 0.2 0.4
22 22 6 6 1 0.2 0.4
23 23 6 6 1 0.2 0.4
In [29]:
prediction = model.predict(var)
print(prediction)
[101.536964   68.16906    47.533768   17.59019     2.8934195   1.1388865
  21.187677   63.02645   167.98474   240.2805    284.7941    327.33487
 352.4724    354.37387   361.5893    353.91556   333.35376   308.66312
 274.1451    209.21059   176.42868   167.20737   147.78569   118.53838  ]
In [30]:
fig = go.Figure()

fig.update_layout(
    title="Prediction de la location des vélos par heure",
    xaxis_title="Heure",
    yaxis_title="Nbr de locations",
)

fig.add_trace(go.Scatter(
    x=var['hr'],
    y=prediction,
    name="Prediction"
))

fig.show()

J'avais dans l'idée d'utiliser le stacking afin de créer mon modèle de ML. Cependant, je me suis rendu compte qu'un seul modèle de boosting en choisissant correctement les features au préalable et en éléminant les outliers suffisait. Le boosting permet de diminuer le biais du modèle. Quant au bagging, qui diminue la variance, je m'étais déjà occupé de supprimer les outliers dans le traitement des données.

Pour trouver les meilleurs hyper-parameters, j'ai utilisé GridSearchCV afin d'appliquer de la cross validation. De plus, je me suis penché sur les corrélations entre la feature à prédire "cnt" et les autres afin d'empecher l'overfitting avec "casual" et "registered". Pour finir, j'ai utilisé la feature importance afin d'optimiser mon modèle au mieux. Aussi bien d'un point de vue vitesse d'execution que précision.

Le but premier était de trouver les patterns du dataset, les corrélations, et supprimer les données abérrantes. Comprendre les données. Tout cela pour entreprendre la création d'un modèle dans les meilleures conditions.

Il pourrait être intéressant de comprendre quel genre de personne s'abonne au service et qui préfère rester un simple utilisateur. Afin de faire de la segmentation sur la clientèle et cibler les personnes encore rétisantes à prendre un abonnement. Pour cela, on aurait besoin de d'autres features comme l'âge, la location, les trajets ... des personnes.